Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods

نویسندگان

  • Kimmo Kettunen
  • Timo Honkela
  • Krister Lindén
  • Pekka Kauppinen
  • Tuula Pääkkönen
  • Jukka Kervinen
چکیده

In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowdsourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Stock Return Forecasting by Deep Learning Algorithm

Improving return forecasting is very important for both investors and researchers in financial markets. In this study we try to aim this object by two new methods. First, instead of using traditional variable, gold prices have been used as predictor and compare the results with Goyal's variables. Second, unlike previous researches new machine learning algorithm called Deep learning (DP) has bee...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

The Use of Technology in English Language Learning: A Literature Review

The use of technology has become an important part of the learning process in and out of the class. Every language class usually uses some form of technology. Technology has been used to both help and improve language learning. Technology enables teachers to adapt classroom activities, thus enhancing the language learning process. Technology continues to grow in importance as a tool to help tea...

متن کامل

Corefrence resolution with deep learning in the Persian Labnguage

Coreference resolution is an advanced issue in natural language processing. Nowadays, due to the extension of social networks, TV channels, news agencies, the Internet, etc. in human life, reading all the contents, analyzing them, and finding a relation between them require time and cost. In the present era, text analysis is performed using various natural language processing techniques, one ...

متن کامل

Enrichment of English Language Curriculum with Assistive Technology Approach and its Impact on learning of Students with Physical-Motor Impairments: A New Strategic Approach to Inclusive Education

Introduction: The use of new technologies in education is one of the topics that has attracted the attention of educational experts in the past two decades. The purpose of this study was to investigate the effect of Instructional Model enriched with Assistive Technology on learning of Students with Physical-Motor Impairments in English classes. Methods: The research method is semi-experimental ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014